Excel & R

Quantitative Methodology (UPF)

Jordi Mas Elias

https://www.jordimas.cat/

Summary

  • Types of data files
  • Excel Workflow
  • Collaborative spreadcheets
  • Dynamic tables in Excel
  • R Workflow
  • Objects in R
  • Functions in R

Types of data files

Types of data files

Plain files:

  • .csv, .csv2, .tsv, .txt.

Program files:

  • Excel: .xlsx, .xls.
  • R: .Rdata.
  • Stata: .dta.
  • SPSS: .sav, .spss.

Types of data files

Comma Separated Values (csv).

  • Comas (,) separate variables.
  • Points (.) separate decimals.

Types of data files

Comma Separated Values 2 (csv).

  • Semi-colons (;) separate variables.
  • Commas (,) separate decimals.

Types of data files

Tab-Separated Values 2 (tsv).

  • Tabulators (        ) separate variables.
  • Points (.) separate decimals.

Types of data files

Other delimiters (txt).

  • Forward slash (/) separate variables.
  • Points (.) separate decimals.

Excel

Excel workflow

  • Cells, columns files.
  • Reference a cell.
  • Import data.
  • Basic operations.
    • Equal = symbol.
    • Functions (AVERAGE, SUM, MAX, MIN).
    • Scentific notation: E+16.
    • Drag cells.
    • Dollar $ symbol to impeed draging cells.

Collaborative spreadcheets

World University Rankings 2023 (Times Higher Education).

  • International students.
  • Ratio student per staff.
  • Female students.
  • Full Time Equivalent (FTE) students.

Dynamic tables in Excel

Google Spreadsheets:

  • Data -> Create a Filter.

Microsoft Excel

  • Insert -> Table

R

RStudio workflow

  1. Create a project
  2. Four screens:
  • Files: getwd(), dir()
  • Console: 1:10, sample(1:10, 2), d <- 1
  • Script: #, space.
  • Environment: objects(), ls(), search()

Packages

  • Install: Once in a year.
install.packages(c("dplyr", "ggplot", "tidyr", 
                   "readr", "readxl", "haven", "foreign"))
  • Load: Everytime you join R.
library(dplyr)
library(ggplot2)
library(readr)

R objects

Objects (I)

  • Values (A)
  • Vectors (B)
  • Dataframes (C)

Objects (II). General rules

General rules for creating objects:

  • Can’t start with a number.
  • Can’t contain: ^, !, $, @, +, -, /, *.
  • Overwrittes if saving with the same name.
  • Case sensitive.

Objects (III). Values

  • Numbers: Decimals with a dot (.).
1
5.9
4204455
  • Character (strings): Case sensitive.
"D"
"Biden"
"Torroella de Mongrí"`
  • Logical: Only two values.
TRUE
FALSE

Objects (IV): Vectors

Vector structure

A vector is formed by a chain of values. It is created with the function c().

c(value1, value2, value3, value4...)
  • Numeric
numeric <- c(4, 5, 6, 7, 8)
num_dec <- c(54.98, 55.97, 76.58, 47.33, 82.24)
numeric / 2
num_dec / numeric
  • Integer
integer_L <- c(3L, 6L, 7L, 5L, 9L)
integer_F <- as.integer(c(2022, 2021, 2021, 2021, 2020))
  • Character
character <- c("Colombia", "Japan", "Germany", "Chile", "New Zealand")
mingle <- c("4$", "אַפֿגהאַניסטאַן", "البانیہ", "%%16", "漢字%3")
  • Factor
factor <- factor(c("America", "Asia", "Europe", "America", "Oceania"))
  • Logical
logic <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
  • Data
date <- as.Date(c("2022/05/29", "2021/10/31", "2021/09/26", "2021/11/21", "2020/10/17"))

Objects (V): Dataframes

Dataframe structure

A dataframe is formed by several vectors. It is created with the function tibble().

tibble(vector1, vector2, vector3, vector4...)

Two ways of creating a dataframe:

  1. Save the vectors, then create the df
country <- c("Colombia", "Japan", "Germany", "Chile", "New Zealand")
year <- as.integer(c(2022, 2021, 2021, 2021, 2020))
date <- as.Date(c("2022/05/29", "2021/10/31", "2021/09/26", "2021/11/21", "2020/10/17"))
turnout <- c(54.98, 55.97, 76.58, 47.33, 82.24)
continent <- factor(c("America", "Asia", "Europe", "America", "Oceania"))
presidential <- c(TRUE, FALSE, FALSE, TRUE, FALSE)

elections <- tibble(country, year, date, turnout, continent, presidential)
  1. Do it all in the same procedure.
elections <- tibble(country = c("Colombia", "Japan", "Germany", "Chile", "New Zealand"), 
                    year = as.integer(c(2022, 2021, 2021, 2021, 2020)), 
                    date = as.Date(c("2022/05/29", "2021/10/31", "2021/09/26", "2021/11/21", "2020/10/17")), 
                    turnout = c(54.98, 55.97, 76.58, 47.33, 82.24), 
                    continent = factor(c("Europe", "Asia", "Europe", "America", "Oceania")), 
                    presidential = c(TRUE, FALSE, FALSE, TRUE, FALSE))

Objects (VI): Selection

Vector in a dataframe

elections$turnout
## [1] 54.98 55.97 76.58 47.33 82.24

Use of brackets

elections$turnout[1]
## [1] 54.98
elections$turnout[c(1,3)]
## [1] 54.98 76.58
elections[1,]
## # A tibble: 1 × 6
##   country   year date       turnout continent presidential
##   <chr>    <int> <date>       <dbl> <fct>     <lgl>       
## 1 Colombia  2022 2022-05-29    55.0 Europe    TRUE
elections[,c(1,4)]
## # A tibble: 5 × 2
##   country     turnout
##   <chr>         <dbl>
## 1 Colombia       55.0
## 2 Japan          56.0
## 3 Germany        76.6
## 4 Chile          47.3
## 5 New Zealand    82.2
elections[elections$presidential,]
## # A tibble: 2 × 6
##   country   year date       turnout continent presidential
##   <chr>    <int> <date>       <dbl> <fct>     <lgl>       
## 1 Colombia  2022 2022-05-29    55.0 Europe    TRUE        
## 2 Chile     2021 2021-11-21    47.3 America   TRUE

R functions

Functions (I)

Function structure

A function is formed by one or several arguments. It is created with the function function().

function(argument1, argument2 ...)

We can easily create a function:

per_four <- function(x) {x * 4}

And apply it to any object:

per_four(2) #to a value
## [1] 8
per_four(elections$turnout) #to a vector
## [1] 219.92 223.88 306.32 189.32 328.96

Functions (II): Import functions

  • Package readr.
read_csv("files/gapminder.csv")
read_csv2("files/gapminder2.csv")
read_tsv("files/gapminder3.tsv")
read_delim("files/gapminder4.txt", delim = "/")
  • Other packages:
tibble(foreign::read.dta("files/gapminder5.dta"))
load("files/gapminder6.Rdata")
tibble(read.spss("files/gapminder7.sav", to.data.frame = T))

Functions (III): With one argument

Applied normally to a dataframe:

glimpse()
dim()
summary()

Applied normally to a vector:

#to a character vector
unique()
table()

#to a numeric vector
mean() 
hist()

Functions (IV): With many arguments

Help!

R is impossible to use without help.

  • Use ? before a function: e.g. ?sample.
  • Use Google.
  • Use Cheet Sheats: e.g. Dplyr.
  • Use Stackoverflow.

Some functions with (normally) more than one argument:

sample()
seq()
rep()

Functions (IV): Without arguments

Normally, when they are related to the working environment.

ls()
installed.packages()
search()
getwd()

Import data to Rv

Vectors and data frames in R

Objects and functions in R